ZIB is the partner of the DFG-funded collaborative research center “FONDA – Foundations of Workflows for Large-Scale Scientific Data Analysis” (CRC 1404) led by Humboldt-Universität zu Berlin. FONDA aims to enhance human productivity in data analysis workflows (DAWs). By now, almost all scientific disciplines use DAWs to process ever-increasing amounts of data. But portability, availability, and dependability of scientific workflows are limited and means to increase them are necessary. In addition, improving productivity is crucial not only for computer science but also across various fields. Consequently, many FONDA subprojects include both computer scientists and experts from other natural sciences. This interdisciplinary approach fosters a deeper understanding of real-world workflows, ensuring practical applicability.

Figure 1

We participate in subproject B4, focused on optimizing the execution of DAWs. By treating workflow tasks as black boxes, we developed a generic optimization architecture (Figure 1). Through task execution monitoring, we create models that represent each task’s requirements using mathematical functions. These models enable rapid testing of different resource allocations, orders of magnitude faster than state-of-the-art work-flow simulations based on discrete event simulation. Our ‘tests’ perform bottleneck analysis by not only predicting the makespan of tasks but also their actual resource needs and the dependencies affecting execution (Figure 2). Analyzing chains of tasks is also possible by using one task’s output as an input for another, allow-ing comprehensive bottleneck analysis of entire work-flow executions. Based on this fast bottleneck analysis, heuristic algorithms can optimize resource allocation before and during execution, enhancing accuracy and efficiency. To facilitate just-in-time optimization, we determine the current progress of the workflow and executing tasks through live monitoring data.

To address our monitoring requirements, we explored various methods and developed a custom solution for efficient yet comprehensive I/O monitoring. We discovered that it is challenging to track low-level metrics like I/O requests and correlate them with higher-level concepts such as workflow tasks. For appropriate performance prediction of I/O-heavy workloads, we de-veloped I/O models that consider the caching behavior of the Linux kernel for different I/O methods that significantly improve the prediction accuracy compared to event-based simulation. With our IOSIG plug-in for GCC, we allow pragma annotations to the source code expressing the input/output (I/O) characteristics for certain I/O streams and then decide during execution to which devices the input/output should best be re redirected.

Figure 2

Research for the first phase of FONDA officially began in July 2020 and concluded four years later. We presented and published our results at international scientific conferences, won a ‘best poster award’ at eScience’24 and Masoud Jami, né Gholami, successfully finished and received his PhD in computer science contributing to the topics from Humboldt-Universität zu Berlin in 2024. We applied for a second phase of FONDA and our subproject, which now receives funding following a thorough review by DFG in February 2024. The second phase of FONDA started in July 2024 and will also span four years. In this phase, we will shift our focus from executing a single workflow within one data center to running workflows concurrently across multiple data centers. This new scenario introduces challenges such as typically slower connections between different data centers, making data placement significantly more important. Additionally, the concurrent execution of multiple workflows naturally increases the complexity of optimization approaches. While we can build on our previous work from the first phase, these new factors present unique hurdles that will require innovative solutions.

Joel Witzke, Florian Schintke